dynamic benchmark
VisIT-Bench: A Dynamic Benchmark for Evaluating Instruction-Following Vision-and-Language Models
We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for evaluating instruction-following vision-language models for real-world use. Our starting point is curating 70 instruction families that we envision instruction tuned vision-language models should be able to address. Extending beyond evaluations like VQAv2 and COCO, tasks range from basic recognition to game playing and creative generation. Following curation, our dataset comprises 592 test queries, each with a human-authored instruction-conditioned caption. These descriptions surface instruction-specific factors, e.g., for an instruction asking about the accessibility of a storefront for wheelchair users, the instruction-conditioned caption describes ramps/potential obstacles. These descriptions enable 1) collecting human-verified reference outputs for each instance; and 2) automatic evaluation of candidate multimodal generations using a text-only LLM, aligning with human judgment. We quantify quality gaps between models and references using both human and automatic evaluations; e.g., the top-performing instruction-following model wins against the GPT-4 reference in just 27% of the comparison. VisIT-Bench is dynamic to participate, practitioners simply submit their model's response on the project website; Data, code and leaderboard is available at https://visit-bench.github.io/.
CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency
Guo, Jiacheng, Huang, Suozhi, Yao, Zixin, Zhang, Yifan, Lu, Yifu, Liu, Jiashuo, Li, Zihao, Deng, Nicholas, Xiao, Qixin, Tian, Jia, Zhan, Kanghong, Li, Tianyi, Liu, Xiaochen, Ge, Jason, He, Chaoyang, Huang, Kaixuan, Yang, Lin, Huang, Wenhao, Wang, Mengdi
This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Model (LLM) agents in the uniquely demanding and fast-paced cryptocurrency domain. Unlike general-purpose agent benchmarks for search and prediction, professional crypto analysis presents specific challenges: \emph{extreme time-sensitivity}, \emph{a highly adversarial information environment}, and the critical need to synthesize data from \emph{diverse, specialized sources}, such as on-chain intelligence platforms and real-time Decentralized Finance (DeFi) dashboards. CryptoBench thus serves as a much more challenging and valuable scenario for LLM agent assessment. To address these challenges, we constructed a live, dynamic benchmark featuring 50 questions per month, expertly designed by crypto-native professionals to mirror actual analyst workflows. These tasks are rigorously categorized within a four-quadrant system: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. This granular categorization enables a precise assessment of an LLM agent's foundational data-gathering capabilities alongside its advanced analytical and forecasting skills. Our evaluation of ten LLMs, both directly and within an agentic framework, reveals a performance hierarchy and uncovers a failure mode. We observe a \textit{retrieval-prediction imbalance}, where many leading models, despite being proficient at data retrieval, demonstrate a pronounced weakness in tasks requiring predictive analysis. This highlights a problematic tendency for agents to appear factually grounded while lacking the deeper analytical capabilities to synthesize information.
Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety
Nair, Variath Madhupal Gautham, Dantuluri, Vishal Varma
Existing large language models (LLMs) are advancing rapidly and produce outstanding results in image generation tasks, yet their content safety checks remain vulnerable to prompt-based jailbreaks. Through preliminary testing on platforms such as ChatGPT, MetaAI, and Grok, we observed that even short, natural prompts could lead to the generation of compromising images ranging from realistic depictions of forged documents to manipulated images of public figures. We introduce Unmasking the Canvas (UTC Benchmark; UTCB), a dynamic and scalable benchmark dataset to evaluate LLM vulnerability in image generation. Our methodology combines structured prompt engineering, multilingual obfuscation (e.g., Zulu, Gaelic, Base64), and evaluation using Groq-hosted LLaMA-3. The pipeline supports both zero-shot and fallback prompting strategies, risk scoring, and automated tagging. All generations are stored with rich metadata and curated into Bronze (non-verified), Silver (LLM-aided verification), and Gold (manually verified) tiers. UTCB is designed to evolve over time with new data sources, prompt templates, and model behaviors. Warning: This paper includes visual examples of adversarial inputs designed to test model safety. All outputs have been redacted to ensure responsible disclosure.
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
Chen, Simin, Chen, Yiming, Li, Zexin, Jiang, Yifan, Wan, Zhongwei, He, Yixin, Ran, Dezhi, Gu, Tianle, Li, Haizhou, Xie, Tao, Ray, Baishakhi
Data contamination has received increasing attention in the era of large language models (LLMs) due to their reliance on vast Internet-derived training corpora. To mitigate the risk of potential data contamination, LLM benchmarking has undergone a transformation from static to dynamic benchmarking. In this work, we conduct an in-depth analysis of existing static to dynamic benchmarking methods aimed at reducing data contamination risks. We first examine methods that enhance static benchmarks and identify their inherent limitations. We then highlight a critical gap-the lack of standardized criteria for evaluating dynamic benchmarks. Based on this observation, we propose a series of optimal design principles for dynamic benchmarking and analyze the limitations of existing dynamic benchmarks. This survey provides a concise yet comprehensive overview of recent advancements in data contamination research, offering valuable insights and a clear guide for future research efforts. We maintain a GitHub repository to continuously collect both static and dynamic benchmarking methods for LLMs. The repository can be found at this link.
Reviews: A Robust Non-Clairvoyant Dynamic Mechanism for Contextual Auctions
It is quite unclear, since in [Medina & Mohri, 2014], the benchmark is the best possible one, because it is equal exactly to the valuation of the buyer and, hence, generate the maximal revenue each round. So, even any dynamic pricing cannot provide higher revenue than this one. The same issue occurs in Lines 81-83. Comment after rebuttal: I got the answer in general. I hope, the authors will improve clearness in the lines that I have indicated above.
VisIT-Bench: A Dynamic Benchmark for Evaluating Instruction-Following Vision-and-Language Models
We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for evaluating instruction-following vision-language models for real-world use. Our starting point is curating 70 "instruction families" that we envision instruction tuned vision-language models should be able to address. Extending beyond evaluations like VQAv2 and COCO, tasks range from basic recognition to game playing and creative generation. Following curation, our dataset comprises 592 test queries, each with a human-authored instruction-conditioned caption. These descriptions surface instruction-specific factors, e.g., for an instruction asking about the accessibility of a storefront for wheelchair users, the instruction-conditioned caption describes ramps/potential obstacles.
Dynamic Benchmarks: Spatial and Temporal Alignment for ADS Performance Evaluation
Chen, Yin-Hsiu, Scanlon, John M., Kusano, Kristofer D., McMurry, Timothy L., Victor, Trent
Deployed SAE level 4+ Automated Driving Systems (ADS) without a human driver are currently operational ride-hailing fleets on surface streets in the United States. This current use case and future applications of this technology will determine where and when the fleets operate, potentially resulting in a divergence from the distribution of driving of some human benchmark population within a given locality. Existing benchmarks for evaluating ADS performance have only done county-level geographical matching of the ADS and benchmark driving exposure in crash rates. This study presents a novel methodology for constructing dynamic human benchmarks that adjust for spatial and temporal variations in driving distribution between an ADS and the overall human driven fleet. Dynamic benchmarks were generated using human police-reported crash data, human vehicle miles traveled (VMT) data, and over 20 million miles of Waymo's rider-only (RO) operational data accumulated across three US counties. The spatial adjustment revealed significant differences across various severity levels in adjusted crash rates compared to unadjusted benchmarks with these differences ranging from 10% to 47% higher in San Francisco, 12% to 20% higher in Maricopa, and 7% lower to 34% higher in Los Angeles counties. The time-of-day adjustment in San Francisco, limited to this region due to data availability, resulted in adjusted crash rates 2% lower to 16% higher than unadjusted rates, depending on severity level. The findings underscore the importance of adjusting for spatial and temporal confounders in benchmarking analysis, which ultimately contributes to a more equitable benchmark for ADS performance evaluations.
Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models
Kurtic, Eldar, Moeini, Amir, Alistarh, Dan
The ability of large language models (LLMs) to approach non-trivial tasks involving both information retrieval and mathematical reasoning has led to significant research interest in evaluating these properties. Yet, the popularity of reasoning benchmarks, such as the often-used Grade-School Math (GSM) [1] or MATH [2] datasets, is leading to performance saturation (see Figure 1), and can potentially lead to training set contamination. Thus, there is a stringent need to develop new strong benchmarks to evaluate LLM reasoning. We address this by proposing Mathador-LM, a new benchmark for examining the mathematical reasoning properties of LLMs. At a high level, Mathador-LM follows the popular Mathador mathematical game [3], in which a human player is given five base numbers together with a target number, and has to provide a series of calculations, each using one of the four basic arithmetic operations, which result in the target number.
Dynamic Model Predictive Shielding for Provably Safe Reinforcement Learning
Banerjee, Arko, Rahmani, Kia, Biswas, Joydeep, Dillig, Isil
Among approaches for provably safe reinforcement learning, Model Predictive Shielding (MPS) has proven effective at complex tasks in continuous, high-dimensional state spaces, by leveraging a backup policy to ensure safety when the learned policy attempts to take risky actions. However, while MPS can ensure safety both during and after training, it often hinders task progress due to the conservative and task-oblivious nature of backup policies. This paper introduces Dynamic Model Predictive Shielding (DMPS), which optimizes reinforcement learning objectives while maintaining provable safety. DMPS employs a local planner to dynamically select safe recovery actions that maximize both short-term progress as well as long-term rewards. Crucially, the planner and the neural policy play a synergistic role in DMPS. When planning recovery actions for ensuring safety, the planner utilizes the neural policy to estimate long-term rewards, allowing it to observe beyond its short-term planning horizon. Conversely, the neural policy under training learns from the recovery plans proposed by the planner, converging to policies that are both high-performing and safe in practice. This approach guarantees safety during and after training, with bounded recovery regret that decreases exponentially with planning horizon depth. Experimental results demonstrate that DMPS converges to policies that rarely require shield interventions after training and achieve higher rewards compared to several state-of-the-art baselines.
Online Covering with Multiple Experts
Kevi, Enikő, Nguyen, Kim-Thang
Designing online algorithms with machine learning predictions is a recent technique beyond the worst-case paradigm for various practically relevant online problems (scheduling, caching, clustering, ski rental, etc.). While most previous learning-augmented algorithm approaches focus on integrating the predictions of a single oracle, we study the design of online algorithms with \emph{multiple} experts. To go beyond the popular benchmark of a static best expert in hindsight, we propose a new \emph{dynamic} benchmark (linear combinations of predictions that change over time). We present a competitive algorithm in the new dynamic benchmark with a performance guarantee of $O(\log K)$, where $K$ is the number of experts, for $0-1$ online optimization problems. Furthermore, our multiple-expert approach provides a new perspective on how to combine in an online manner several online algorithms - a long-standing central subject in the online algorithm research community.